Text Reuse Detection using a Composition of Text Similarity Measures
نویسندگان
چکیده
Detecting text reuse is a fundamental requirement for a variety of tasks and applications, ranging from journalistic text reuse to plagiarism detection. Text reuse is traditionally detected by computing similarity between a source text and a possibly reused text. However, existing text similarity measures exhibit a major limitation: They compute similarity only on features which can be derived from the content of the given texts, thereby inherently implying that any other text characteristics are negligible. In this paper, we overcome this traditional limitation and compute similarity along three characteristic dimensions inherent to texts: content, structure, and style. We explore and discuss possible combinations of measures along these dimensions, and our results demonstrate that the composition consistently outperforms previous approaches on three standard evaluation datasets, and that text reuse detection greatly benefits from incorporating a diverse feature set that reflects a wide variety of text characteristics. TITLE AND ABSTRACT IN GERMAN Erkennung von Textwiederverwendung durch Komposition von Textähnlichkeitsmaßen Die Frage, ob und in welcher Weise Texte in abgewandelter Form wiederverwendet werden, ist ein zentraler Aspekt bei einer Reihe von Problemstellungen, etwa im Rahmen journalistischer Tätigkeit oder als Mittel zur Plagiatserkennung. Textwiederverwendung wird traditionell ermittelt durch Berechnen von Textähnlichkeit zwischen einem Ursprungstext und einem potentiell wiederverwendeten Text. Bestehende Textähnlichkeitsmaße haben jedoch die starke Einschränkung, dass sie Ähnlichkeit nur anhand von Eigenschaften berechnen, die vom Inhalt der gegebenen Texte abgeleitet werden können, und somit implizieren, dass jegliche andere Textcharacteristika vernächlässigbar sind. In dieser Arbeit berechnen wir Textähnlichkeit anhand von drei Dimensionen: Inhalt, Struktur und Stil. Wir untersuchen mögliche Kombinationen von Maßen entlang dieser Dimensionen, und zeigen deutlich anhand der Ergebnisse auf drei etablierten Evaluationsdatensätzen, dass die Komposition generell bessere Ergebnisse liefert als bestehende Ansätze, und dass die Bestimmung von Textwiederverwendung stark von einem breiten Spektrum an Textcharacteristika profitiert.
منابع مشابه
The Short Stories Corpus: Notebook for PAN at CLEF 2015
In this work we describe the construction of a plagiarism detection/text reuse corpus submitted for the PAN-2015 Evaluation Lab. Our corpus consists of four different text reuse scenarios namely, (1) no-plagiarism, (2) story-retelling, (3) synonym-replacement and (4) character-substitution. Among these scenarios the most interesting one is story retelling through it we find patterns of textual ...
متن کاملCOUNTER: corpus of Urdu news text reuse
Text reuse is the act of borrowing text from existing documents to create new texts. Freely available and easily accessible large online repositories are not only making reuse of text more common in society but also harder to detect. A major hindrance in the development and evaluation of existing/new mono-lingual text reuse detection methods, especially for South Asian languages, is the unavail...
متن کاملDeveloping Bilingual Plagiarism Detection Corpus Using Sentence Aligned Parallel Corpus: Notebook for PAN at CLEF 2015
Plagiarism detection is the process of locating text reuse within a suspicious document. The plagiarism detection corpora are used for evaluating plagiarism detection systems. In this paper, we present a bilingual PersianEnglish plagiarism detection corpus. We provide our corpus for the task of text alignment corpus construction in the PAN 2015 competition. Our approach is based on parallel cor...
متن کاملDynamically Adjustable Approach through Obfuscation Type Recognition
The task of (monolingual) text alignment consists in finding similar text fragments between two given documents. It has applications in plagiarism detection, detection of text reuse, author identification, authoring aid, and information retrieval, to mention only a few. We describe our approach to the text alignment subtask of the plagiarism detection competition at PAN 2015. Our method relies ...
متن کاملPlagiarism checker for Persian (PCP) texts using hash-based tree representative fingerprinting
With due respect to the authors’ rights, plagiarism detection, is one of the critical problems in the field of text-mining that many researchers are interested in. This issue is considered as a serious one in high academic institutions. There exist language-free tools which do not yield any reliable results since the special features of every language are ignored in them. Considering the paucit...
متن کامل